NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Speech-driven animation with meaningful behaviors

https://doi.org/10.1016/j.specom.2019.04.005

Sadoughi, Najmeh; Busso, Carlos (July 2019, Speech Communication)

Full Text Available
Speech-Driven Expressive Talking Lips with Conditional Sequential Generative Adversarial Networks

https://doi.org/10.1109/TAFFC.2019.2916031

Sadoughi, Najmeh; Busso, Carlos (May 2019, IEEE Transactions on Affective Computing)
null (Ed.)
Articulation, emotion, and personality play strong roles in the orofacial movements. To improve the naturalness and expressiveness of virtual agents(VAs), it is important that we carefully model the complex interplay between these factors. This paper proposes a conditional generative adversarial network, called conditional sequential GAN(CSG), which learns the relationship between emotion, lexical content and lip movements in a principled manner. This model uses a set of spectral and emotional speech features directly extracted from the speech signal as conditioning inputs, generating realistic movements. A key feature of the approach is that it is a speech-driven framework that does not require transcripts. Our experiments show the superiority of this model over three state-of-the-art baselines in terms of objective and subjective evaluations. When the target emotion is known, we propose to create emotionally dependent models by either adapting the base model with the target emotional data (CSG-Emo-Adapted), or adding emotional conditions as the input of the model(CSG-Emo-Aware). Objective evaluations of these models show improvements for the CSG-Emo-Adapted compared with the CSG model, as the trajectory sequences are closer to the original sequences. Subjective evaluations show significantly better results for this model compared with the CSG model when the target emotion is happiness.
more » « less
Full Text Available
Expressive Speech-Driven Lip Movements with Multitask Learning

https://doi.org/10.1109/FG.2018.00066

Sadoughi, Najmeh; Busso, Carlos (May 2018, IEEE Conference on Automatic Face and Gesture Recognition (FG 2018))

The orofacial area conveys a range of information, including speech articulation and emotions. These two factors add constraints to the facial movements, creating non-trivial integrations and interplays. To generate more expressive and naturalistic movements for conversational agents (CAs) the relationship between these factors should be carefully modeled. Data-driven models are more appropriate for this task than rule-based systems. This paper provides two deep learning speech-driven structures to integrate speech articulation and emotional cues. The proposed approaches rely on multitask learning (MTL) strategies, where related secondary tasks are jointly solved when synthesizing orofacial movements. In particular, we evaluate emotion recognition and viseme recognition as secondary tasks. The approach creates shared representations that generate behaviors that not only are closer to the original orofacial movements, but also are perceived more natural than the results from single task learning.
more » « less
Full Text Available
Novel Realizations of Speech-driven Head Movements with Generative Adversarial Networks

Sadoughi, Najmeh and (January 2018, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018))

Head movement is an integral part of face-to-face communications. It is important to investigate methodologies to generate naturalistic movements for conversational agents (CAs). The predominant method for head movement generation is using rules based on the meaning of the message. However, the variations of head movements by these methods are bounded by the predefined dictionary of gestures. Speech-driven methods offer an alternative approach, learning the relationship between speech and head movements from real recordings. However, previous studies do not generate novel realizations for a repeated speech signal. Conditional generative adversarial network (GAN) provides a framework to generate multiple realizations of head movements for each speech segment by sampling from a conditioned distribution. We build a conditional GAN with bidirectional long-short term memory (BLSTM), which is suitable for capturing the long-short term dependencies of time- continuous signals. This model learns the distribution of head movements conditioned on speech prosodic features. We compare this model with a dynamic Bayesian network (DBN) and BLSTM models optimized to reduce mean squared error (MSE) or to increase concordance correlation. The objective evaluations and subjective evaluations of the results showed better performance for the condi- tional GAN model compared with these baseline systems.
more » « less
Full Text Available
Meaningful head movements driven by emotional synthetic speech

https://doi.org/10.1016/j.specom.2017.07.004

Sadoughi, Najmeh; Liu, Yang; Busso, Carlos (December 2017, Speech Communication)

Full Text Available
Joint Learning of Speech-Driven Facial Motion with Bidirectional Long-Short Term Memory

https://doi.org/10.1007/978-3-319-67401-8_49

Sadoughi, Najmeh; and Busso, Carlos (January 2017, International Conference on Intelligent Virtual Agents (IVA 2017))

Thefaceconveysablendofverbalandnonverbalinformation playing an important role in daily interaction. While speech articulation mostly affects the orofacial areas, emotional behaviors are externalized across the entire face. Considering the relation between verbal and non-verbal behaviors is important to create naturalistic facial movements for conversational agents (CAs). Furthermore, facial muscles connect areas across the face, creating principled relationships and dependencies between the movements that have to be taken into account. These relationships are ignored when facial movements across the face are sep- arately generated. This paper proposes to create speech-driven models that jointly capture the relationship not only between speech and facial movements, but also across facial movements. The input to the models are features extracted from speech that convey the verbal and emotional states of the speakers. We build our models with bidirectional long-short term memory (BLSTM) units which are shown to be very successful in modeling dependencies for sequential data. The objective and subjective evaluations of the results demonstrate the benefits of joint modeling of facial regions using this framework.
more » « less
Full Text Available

Search for: All records